# NY TAXI DATA SCIENCE FUN

NY TAXI DATA SCIENCE FUN

 
### Basic Questions:
1. What are the distributions of the number of passengers per trip, payment type, fare amount, tip amount, and total amount?
2. What are top 5 busiest hours of the day, and the top 10 busiest locations of the city?
3. What is the hourly taxi activity for each day of the week?
4. Which trip has the most consistent fares?
### Open Questions:
1. Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?
2. Can you predict the pickup / drop off geographical distribution for each hour of a weekday?
3. If you were a taxi owner, how would you maximize your earnings in a day?
4. If you run a taxi company, how would you maximize your earnings?

Basic Questions:

  1. What are the distributions of the number of passengers per trip, payment type, fare amount, tip amount, and total amount?

  2. What are top 5 busiest hours of the day, and the top 10 busiest locations of the city?

  3. What is the hourly taxi activity for each day of the week?

  4. Which trip has the most consistent fares?

Open Questions:

  1. Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?

  2. Can you predict the pickup / drop off geographical distribution for each hour of a weekday?

  3. If you were a taxi owner, how would you maximize your earnings in a day?

  4. If you run a taxi company, how would you maximize your earnings?

In [1]:
 
t=1
t
Out[1]:
1
In [31]:
 
import pandas as pd
import numpy as np
import matplotlib  
import matplotlib.pyplot as plt 
import numpy as np
import plotly.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly import tools
#initiate the Plotly Notebook mode
init_notebook_mode()
df_big = pd.read_csv('../data/yellow_tripdata_2016-01.csv')
#df_big_clean=df_big.fillna(df_big.mean())#df_big.dropna(axis=1)
df_big_clean=df_big
#df_big_clean <- df_big[!(is.na(df$start_pc) | df$start_pc==""), ] #| is an or-operator and ! inverts. 
#Hence, the command above displays all rows, which are not b) NA or b) equal to ""
df=df_big_clean.loc[0:10000,:]  #use reduces data points for testing mode
#df=df_big                      # use whole month of data
print(df_big.shape)
print(df_big_clean.shape)
df
(2389990, 19)
(2389990, 19)
Out[31]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2016-01-01 00:00:00 2016-01-01 00:00:00 2 1.10 -73.990372 40.734695 1 N -73.981842 40.732407 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
1 2 2016-01-01 00:00:00 2016-01-01 00:00:00 5 4.90 -73.980782 40.729912 1 N -73.944473 40.716679 1 18.0 0.5 0.5 0.00 0.0 0.3 19.30
2 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 10.54 -73.984550 40.679565 1 N -73.950272 40.788925 1 33.0 0.5 0.5 0.00 0.0 0.3 34.30
3 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 4.75 -73.993469 40.718990 1 N -73.962242 40.657333 2 16.5 0.0 0.5 0.00 0.0 0.3 17.30
4 2 2016-01-01 00:00:00 2016-01-01 00:00:00 3 1.76 -73.960625 40.781330 1 N -73.977264 40.758514 2 8.0 0.0 0.5 0.00 0.0 0.3 8.80
5 2 2016-01-01 00:00:00 2016-01-01 00:18:30 2 5.52 -73.980118 40.743050 1 N -73.913490 40.763142 2 19.0 0.5 0.5 0.00 0.0 0.3 20.30
6 2 2016-01-01 00:00:00 2016-01-01 00:26:45 2 7.45 -73.994057 40.719990 1 N -73.966362 40.789871 2 26.0 0.5 0.5 0.00 0.0 0.3 27.30
7 1 2016-01-01 00:00:01 2016-01-01 00:11:55 1 1.20 -73.979424 40.744614 1 N -73.992035 40.753944 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
8 1 2016-01-01 00:00:02 2016-01-01 00:11:14 1 6.00 -73.947151 40.791046 1 N -73.920769 40.865578 2 18.0 0.5 0.5 0.00 0.0 0.3 19.30
9 2 2016-01-01 00:00:02 2016-01-01 00:11:08 1 3.21 -73.998344 40.723896 1 N -73.995850 40.688400 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
10 2 2016-01-01 00:00:03 2016-01-01 00:06:19 1 0.79 -74.006149 40.744919 1 N -73.993797 40.741440 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
11 2 2016-01-01 00:00:03 2016-01-01 00:15:49 6 2.43 -73.969330 40.763538 1 N -73.995689 40.744251 1 12.0 0.5 0.5 3.99 0.0 0.3 17.29
12 2 2016-01-01 00:00:03 2016-01-01 00:00:11 4 0.01 -73.989021 40.721539 1 N -73.988960 40.721699 2 2.5 0.5 0.5 0.00 0.0 0.3 3.80
13 1 2016-01-01 00:00:04 2016-01-01 00:14:32 1 3.70 -74.004303 40.742241 1 N -74.007362 40.706936 1 14.0 0.5 0.5 3.05 0.0 0.3 18.35
14 1 2016-01-01 00:00:05 2016-01-01 00:14:27 2 2.20 -73.991997 40.718578 1 N -74.005135 40.739944 1 11.0 0.5 0.5 1.50 0.0 0.3 13.80
15 2 2016-01-01 00:00:05 2016-01-01 00:07:17 1 0.54 -73.985161 40.768951 1 N -73.990227 40.761730 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
16 2 2016-01-01 00:00:05 2016-01-01 00:07:14 1 1.92 -73.973091 40.795361 1 N -73.978371 40.773151 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
17 1 2016-01-01 00:00:06 2016-01-01 00:04:44 1 1.70 -73.982101 40.774696 1 Y -73.970940 40.796707 1 7.0 0.5 0.5 1.65 0.0 0.3 9.95
18 2 2016-01-01 00:00:06 2016-01-01 00:07:14 1 1.38 -73.994843 40.718498 1 N -73.989807 40.734230 1 7.0 0.5 0.5 1.66 0.0 0.3 9.96
19 1 2016-01-01 00:00:07 2016-01-01 00:20:35 2 4.90 -73.953033 40.672115 1 N -73.986572 40.710594 1 19.0 0.5 0.5 4.06 0.0 0.3 24.36
20 1 2016-01-01 00:00:07 2016-01-01 00:09:49 1 1.80 -73.989166 40.726589 1 N -74.009483 40.715073 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
21 2 2016-01-01 00:00:08 2016-01-01 00:18:51 1 3.09 -73.999069 40.720173 1 N -73.973389 40.756561 2 14.5 0.5 0.5 0.00 0.0 0.3 15.80
22 2 2016-01-01 00:00:08 2016-01-01 00:04:37 1 0.72 -73.997139 40.747219 1 N -74.004486 40.751797 2 5.0 0.5 0.5 0.00 0.0 0.3 6.30
23 2 2016-01-01 00:00:08 2016-01-01 00:03:24 1 0.69 -73.997414 40.736675 1 N -73.985664 40.732681 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
24 1 2016-01-01 00:00:09 2016-01-01 00:19:03 3 5.30 -73.997131 40.736961 1 N -73.928421 40.755581 1 18.0 0.5 0.5 3.85 0.0 0.3 23.15
25 1 2016-01-01 00:00:09 2016-01-01 00:07:18 2 1.20 -73.963913 40.712173 1 N -73.951332 40.712200 2 7.0 0.5 0.5 0.00 0.0 0.3 8.30
26 2 2016-01-01 00:00:10 2016-01-01 00:06:15 2 0.97 -73.999397 40.743900 1 N -73.988876 40.745319 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
27 2 2016-01-01 00:00:10 2016-01-01 00:02:20 1 0.87 -73.954407 40.778069 1 N -73.948929 40.788582 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
28 2 2016-01-01 00:00:12 2016-01-01 00:01:17 1 0.13 -73.991653 40.754559 1 N -73.990601 40.756119 2 3.0 0.5 0.5 0.00 0.0 0.3 4.30
29 1 2016-01-01 00:00:14 2016-01-01 00:13:02 1 2.40 -73.995598 40.744240 1 N -73.985458 40.768711 1 11.0 0.5 0.5 3.05 0.0 0.3 15.35
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9971 2 2016-01-02 01:45:26 2016-01-02 01:51:52 2 2.21 -73.992805 40.747776 1 N -73.986519 40.771732 2 8.5 0.5 0.5 0.00 0.0 0.3 9.80
9972 1 2016-01-02 01:45:27 2016-01-02 01:48:02 1 0.60 -73.988205 40.759205 1 N -73.982246 40.767685 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
9973 1 2016-01-02 01:45:27 2016-01-02 02:00:46 1 4.60 -73.989395 40.760468 1 N -73.920860 40.743256 2 16.0 0.5 0.5 0.00 0.0 0.3 17.30
9974 1 2016-01-02 01:45:27 2016-01-02 02:17:22 3 12.90 -74.004295 40.707962 1 N -73.844490 40.722347 1 38.5 0.5 0.5 0.00 0.0 0.3 39.80
9975 1 2016-01-02 01:45:28 2016-01-02 02:03:06 1 6.10 -74.000961 40.731586 1 N -73.941544 40.800468 1 19.0 0.5 0.5 2.22 0.0 0.3 22.52
9976 1 2016-01-02 01:45:28 2016-01-02 01:51:40 1 1.20 -74.010986 40.710609 1 N -74.010986 40.710609 2 6.5 0.5 0.5 0.00 0.0 0.3 7.80
9977 1 2016-01-02 01:45:28 2016-01-02 01:54:08 1 1.60 -73.973106 40.758457 1 N -73.996124 40.760876 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
9978 2 2016-01-02 01:45:28 2016-01-02 01:56:31 3 3.13 -74.002403 40.718761 1 N -73.977814 40.745529 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
9979 2 2016-01-02 01:45:28 2016-01-02 01:58:02 1 3.09 -73.961632 40.764370 1 N -73.919220 40.755932 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
9980 2 2016-01-02 01:45:28 2016-01-02 01:49:02 1 0.91 -73.994820 40.721390 1 N -73.985573 40.727058 1 5.0 0.5 0.5 1.26 0.0 0.3 7.56
9981 1 2016-01-02 01:45:29 2016-01-02 01:54:47 3 2.30 -74.003494 40.741982 1 N -73.981689 40.764687 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
9982 1 2016-01-02 01:45:29 2016-01-02 01:54:09 1 2.00 -73.999969 40.728603 1 N -73.978676 40.744957 1 9.0 0.5 0.5 2.05 0.0 0.3 12.35
9983 2 2016-01-02 01:45:29 2016-01-02 01:57:43 1 4.89 -73.993301 40.720043 1 N -73.952782 40.742481 2 16.5 0.5 0.5 0.00 0.0 0.3 17.80
9984 2 2016-01-02 01:45:30 2016-01-02 01:58:14 5 3.81 -73.972496 40.677151 1 N -73.926888 40.668835 1 13.5 0.5 0.5 2.96 0.0 0.3 17.76
9985 1 2016-01-02 01:45:31 2016-01-02 01:56:48 1 2.40 -73.910530 40.744858 1 N -73.914238 40.759933 2 10.5 0.5 0.5 0.00 0.0 0.3 11.80
9986 1 2016-01-02 01:45:31 2016-01-02 01:49:36 1 0.60 -74.000542 40.729885 1 N -74.004311 40.722778 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
9987 2 2016-01-02 01:45:31 2016-01-02 01:47:39 1 0.43 -73.994431 40.727772 1 N -74.000458 40.727341 2 3.5 0.5 0.5 0.00 0.0 0.3 4.80
9988 2 2016-01-02 01:45:31 2016-01-02 02:04:54 1 7.40 -73.953514 40.775261 1 N -73.881020 40.755943 1 23.5 0.5 0.5 4.96 0.0 0.3 29.76
9989 2 2016-01-02 01:45:31 2016-01-02 01:55:57 1 4.23 -73.981857 40.746017 1 N -73.943085 40.795063 2 13.5 0.5 0.5 0.00 0.0 0.3 14.80
9990 2 2016-01-02 01:45:31 2016-01-02 01:50:24 1 0.94 -73.983353 40.729210 1 N -73.983353 40.729210 1 5.5 0.5 0.5 1.36 0.0 0.3 8.16
9991 1 2016-01-02 01:45:32 2016-01-02 02:03:24 1 5.90 -73.954666 40.821003 1 N -73.954666 40.821003 1 18.5 0.5 0.5 5.94 0.0 0.3 25.74
9992 2 2016-01-02 01:45:32 2016-01-02 01:55:26 1 2.83 -73.985641 40.763119 1 N -74.001694 40.732391 1 10.5 0.5 0.5 2.36 0.0 0.3 14.16
9993 2 2016-01-02 01:45:32 2016-01-02 02:02:24 2 6.31 -73.972076 40.754040 1 N -73.869659 40.749451 2 19.5 0.5 0.5 0.00 0.0 0.3 20.80
9994 2 2016-01-02 01:45:32 2016-01-02 01:52:22 1 1.65 -73.992012 40.725880 1 N -74.009697 40.709923 1 7.0 0.5 0.5 1.00 0.0 0.3 9.30
9995 2 2016-01-02 01:45:33 2016-01-02 01:54:45 1 2.05 -73.989403 40.750538 1 N -74.003639 40.725395 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
9996 2 2016-01-02 01:45:34 2016-01-02 01:59:03 1 2.84 -73.974426 40.790932 1 N -73.940430 40.822159 1 12.5 0.5 0.5 3.45 0.0 0.3 17.25
9997 2 2016-01-02 01:45:34 2016-01-02 01:55:11 1 3.45 -73.989151 40.726864 1 N -73.958389 40.765392 1 11.5 0.5 0.5 1.00 0.0 0.3 13.80
9998 2 2016-01-02 01:45:35 2016-01-02 01:52:43 1 1.30 -73.968239 40.755379 1 N -73.956322 40.768002 1 7.0 0.5 0.5 1.70 0.0 0.3 10.00
9999 1 2016-01-02 01:45:37 2016-01-02 01:50:31 1 1.20 -73.982224 40.768620 1 N -73.983765 40.779598 1 6.0 0.5 0.5 2.00 0.0 0.3 9.30
10000 2 2016-01-02 01:45:37 2016-01-02 01:59:47 3 2.69 -73.960518 40.710976 1 N -73.925240 40.698357 2 12.0 0.5 0.5 0.00 0.0 0.3 13.30

10001 rows × 19 columns

In [64]:
 
df_big.loc[13000:13150,:]
Out[64]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
13000 2 2016-01-02 02:08:49 2016-01-02 02:22:10 1 3.66 -73.980743 40.729778 1 N -73.986855 40.766556 1 13.0 0.5 0.5 1.50 0.00 0.3 15.80
13001 2 2016-01-02 02:08:49 2016-01-02 02:16:01 1 1.88 -73.982750 40.761650 1 N -73.974564 40.744209 2 8.0 0.5 0.5 0.00 0.00 0.3 9.30
13002 2 2016-01-02 02:08:50 2016-01-02 02:21:26 2 1.69 -73.992004 40.759151 1 N -73.988190 40.757248 2 10.0 0.5 0.5 0.00 0.00 0.3 11.30
13003 2 2016-01-02 02:08:50 2016-01-02 02:15:31 5 2.07 -73.966957 40.756836 1 N -73.946915 40.776031 1 8.0 0.5 0.5 2.32 0.00 0.3 11.62
13004 2 2016-01-02 02:08:51 2016-01-02 02:12:37 5 1.55 -73.974373 40.679893 1 N -73.961258 40.661339 1 6.5 0.5 0.5 1.00 0.00 0.3 8.80
13005 1 2016-01-02 02:08:52 2016-01-02 02:17:31 1 3.30 -74.013161 40.708385 1 N -73.988625 40.748547 2 11.0 0.5 0.5 0.00 0.00 0.3 12.30
13006 2 2016-01-02 02:08:52 2016-01-02 02:24:48 1 7.07 -73.942192 40.754124 1 N -73.989784 40.717003 2 21.5 0.5 0.5 0.00 0.00 0.3 22.80
13007 2 2016-01-02 02:08:52 2016-01-02 02:13:23 2 1.26 -73.969101 40.757248 1 N -73.956055 40.767826 1 6.0 0.5 0.5 1.46 0.00 0.3 8.76
13008 1 2016-01-02 02:08:53 2016-01-02 02:32:33 1 18.10 -73.790123 40.646667 2 N -73.983124 40.726582 1 52.0 0.0 0.5 10.55 0.00 0.3 63.35
13009 1 2016-01-02 02:08:53 2016-01-02 02:18:00 2 1.80 -74.004250 40.742523 1 N -73.999908 40.720490 2 8.5 0.5 0.5 0.00 0.00 0.3 9.80
13010 2 2016-01-02 02:08:54 2016-01-02 02:17:35 1 1.51 -73.920731 40.767883 1 N -73.903122 40.770760 1 8.0 0.5 0.5 2.32 0.00 0.3 11.62
13011 2 2016-01-02 02:08:55 2016-01-02 02:22:37 2 4.60 -73.989792 40.729610 1 N -73.947472 40.783501 2 15.0 0.5 0.5 0.00 0.00 0.3 16.30
13012 2 2016-01-02 02:08:55 2016-01-02 02:13:20 1 0.82 -74.010811 40.703602 1 N -74.013062 40.708839 2 5.0 0.5 0.5 0.00 0.00 0.3 6.30
13013 1 2016-01-02 02:08:57 2016-01-02 02:14:46 1 1.50 -73.965157 40.796139 1 N -73.954323 40.789883 2 7.0 0.5 0.5 0.00 0.00 0.3 8.30
13014 1 2016-01-02 02:08:57 2016-01-02 02:18:35 1 1.80 -74.003975 40.742172 1 N -73.986229 40.762894 2 9.0 0.5 0.5 0.00 0.00 0.3 10.30
13015 1 2016-01-02 02:08:58 2016-01-02 02:16:48 2 1.70 -73.978294 40.745895 1 N -73.989082 40.762810 1 8.5 0.5 0.5 1.95 0.00 0.3 11.75
13016 2 2016-01-02 02:08:58 2016-01-02 02:25:08 1 5.49 -73.990410 40.734692 1 N -73.924538 40.761650 1 18.0 0.5 0.5 3.86 0.00 0.3 23.16
13017 2 2016-01-02 02:08:58 2016-01-02 02:14:12 5 1.58 -73.949219 40.781067 1 N -73.935097 40.796528 2 6.5 0.5 0.5 0.00 0.00 0.3 7.80
13018 2 2016-01-02 02:08:58 2016-01-02 02:15:48 2 0.87 -73.987961 40.759899 1 N -73.988945 40.758244 2 6.5 0.5 0.5 0.00 0.00 0.3 7.80
13019 2 2016-01-02 02:08:59 2016-01-02 02:15:51 1 1.15 -73.986160 40.755508 1 N -74.000572 40.761848 2 6.5 0.5 0.5 0.00 0.00 0.3 7.80
13020 2 2016-01-02 02:08:59 2016-01-02 02:30:03 2 6.72 -74.001114 40.717731 1 N -73.918907 40.701092 1 22.0 0.5 0.5 2.00 0.00 0.3 25.30
13021 1 2016-01-02 02:09:01 2016-01-02 02:20:29 2 2.40 -74.006813 40.739918 1 N -73.979301 40.755238 1 10.5 0.5 0.5 2.00 0.00 0.3 13.80
13022 1 2016-01-02 02:09:02 2016-01-02 02:11:24 1 0.50 -73.990669 40.734962 1 N -73.999474 40.738628 1 4.0 0.5 0.5 1.05 0.00 0.3 6.35
13023 2 2016-01-02 02:09:02 2016-01-02 02:26:56 3 9.78 -73.996429 40.748573 1 N -73.927231 40.865349 1 29.0 0.5 0.5 0.00 0.00 0.3 30.30
13024 2 2016-01-02 02:09:02 2016-01-02 02:18:10 1 2.80 -73.999115 40.745583 1 N -74.014236 40.717781 1 11.0 0.5 0.5 2.46 0.00 0.3 14.76
13025 1 2016-01-02 02:09:03 2016-01-02 02:11:51 1 0.90 -73.986923 40.736519 1 N -73.977989 40.745766 1 5.0 0.5 0.5 1.25 0.00 0.3 7.55
13026 1 2016-01-02 02:09:03 2016-01-02 02:15:46 1 2.00 -73.984985 40.763809 1 N -74.003639 40.741795 1 8.0 0.5 0.5 2.75 0.00 0.3 12.05
13027 2 2016-01-02 02:09:03 2016-01-02 02:17:56 1 1.91 -73.987373 40.691994 1 N -74.001991 40.674381 2 8.5 0.5 0.5 0.00 0.00 0.3 9.80
13028 1 2016-01-02 02:09:04 2016-01-02 02:18:48 1 3.40 -73.988914 40.726898 1 N -73.958076 40.765083 1 11.0 0.5 0.5 2.45 0.00 0.3 14.75
13029 1 2016-01-02 02:09:04 2016-01-02 02:22:34 1 2.80 -73.983681 40.725922 1 N -73.992439 40.754223 2 12.0 0.5 0.5 0.00 0.00 0.3 13.30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13121 2 2016-01-02 02:09:49 2016-01-02 02:28:37 1 9.17 -73.972740 40.793579 1 N -73.870941 40.774158 1 28.0 0.5 0.5 6.97 5.54 0.3 41.81
13122 1 2016-01-02 02:09:50 2016-01-02 02:12:13 1 0.60 -73.953468 40.775578 1 N -73.959259 40.768295 2 4.0 0.5 0.5 0.00 0.00 0.3 5.30
13123 1 2016-01-02 02:09:50 2016-01-02 02:33:20 1 9.10 -73.997307 40.719700 1 N -73.952629 40.825699 2 28.0 0.5 0.5 0.00 0.00 0.3 29.30
13124 2 2016-01-02 02:09:51 2016-01-02 02:21:28 1 6.90 -73.982681 40.774254 1 N -73.935425 40.857361 2 20.5 0.5 0.5 0.00 0.00 0.3 21.80
13125 2 2016-01-02 02:09:52 2016-01-02 02:16:04 1 2.01 -73.993927 40.741299 1 N -73.988342 40.764198 1 8.0 0.5 0.5 2.32 0.00 0.3 11.62
13126 2 2016-01-02 02:09:52 2016-01-02 02:25:15 1 7.11 -73.951584 40.790886 1 N -73.862045 40.821678 2 21.0 0.5 0.5 0.00 0.00 0.3 22.30
13127 2 2016-01-02 02:09:52 2016-01-02 02:34:11 1 9.75 -73.982460 40.750401 1 N -73.975517 40.638523 1 28.5 0.5 0.5 5.96 0.00 0.3 35.76
13128 2 2016-01-02 02:09:52 2016-01-02 02:16:51 1 1.66 -73.999840 40.728580 1 N -73.989647 40.747211 1 7.5 0.5 0.5 1.76 0.00 0.3 10.56
13129 2 2016-01-02 02:09:52 2016-01-02 02:20:20 1 2.89 -74.007088 40.705261 1 N -74.002602 40.730370 2 11.0 0.5 0.5 0.00 0.00 0.3 12.30
13130 2 2016-01-02 02:09:52 2016-01-02 02:16:10 1 1.26 -73.957886 40.717655 1 N -73.949348 40.706711 2 6.5 0.5 0.5 0.00 0.00 0.3 7.80
13131 2 2016-01-02 02:09:53 2016-01-02 02:16:05 1 1.65 -74.002731 40.744511 1 N -74.006432 40.721539 1 7.0 0.5 0.5 1.00 0.00 0.3 9.30
13132 2 2016-01-02 02:09:54 2016-01-02 02:30:53 1 6.10 -74.001060 40.717812 1 N -73.932373 40.671589 1 21.0 0.5 0.5 4.46 0.00 0.3 26.76
13133 2 2016-01-02 02:09:54 2016-01-02 02:21:46 1 2.33 -74.003883 40.742016 1 N -73.985901 40.722023 1 10.5 0.5 0.5 2.36 0.00 0.3 14.16
13134 2 2016-01-02 02:09:54 2016-01-02 02:40:50 1 5.81 -73.985710 40.756302 1 N -73.985397 40.754448 2 24.5 0.5 0.5 0.00 0.00 0.3 25.80
13135 1 2016-01-02 02:09:55 2016-01-02 02:22:23 3 3.00 -73.987221 40.764488 1 N -73.952164 40.777012 1 12.0 0.5 0.5 1.00 0.00 0.3 14.30
13136 1 2016-01-02 02:09:55 2016-01-02 02:15:16 1 0.70 -74.000854 40.736267 1 N -73.991150 40.735054 1 5.5 0.5 0.5 1.35 0.00 0.3 8.15
13137 2 2016-01-02 02:09:55 2016-01-02 02:21:17 1 3.39 -73.977394 40.751976 1 N -73.942673 40.786224 2 12.0 0.5 0.5 0.00 0.00 0.3 13.30
13138 2 2016-01-02 02:09:55 2016-01-02 02:17:29 1 1.88 -73.949165 40.797691 1 N -73.937622 40.820087 2 8.5 0.5 0.5 0.00 0.00 0.3 9.80
13139 1 2016-01-02 02:09:56 2016-01-02 02:27:05 1 4.10 -74.003250 40.726936 1 N -73.964958 40.692112 1 16.0 0.5 0.5 3.45 0.00 0.3 20.75
13140 2 2016-01-02 02:09:56 2016-01-02 02:21:17 1 3.70 -73.967224 40.792637 1 N -73.925102 40.818130 2 12.5 0.5 0.5 0.00 0.00 0.3 13.80
13141 2 2016-01-02 02:09:56 2016-01-02 02:12:58 1 0.65 -73.991104 40.723873 1 N -74.001305 40.724106 2 4.5 0.5 0.5 0.00 0.00 0.3 5.80
13142 1 2016-01-02 02:09:57 2016-01-02 02:11:00 1 0.00 -73.988663 40.721462 1 N -73.987907 40.720776 2 3.0 0.5 0.5 0.00 0.00 0.3 4.30
13143 1 2016-01-02 02:09:57 2016-01-02 02:28:36 1 10.30 -73.980789 40.729870 1 N -73.878151 40.828190 2 30.0 0.5 0.5 0.00 0.00 0.3 31.30
13144 2 2016-01-02 02:09:57 2016-01-02 02:16:01 2 1.52 -74.006027 40.735760 1 N -73.993134 40.754551 1 7.0 0.5 0.5 1.00 0.00 0.3 9.30
13145 2 2016-01-02 02:09:57 2016-01-02 02:22:05 1 3.26 -74.000298 40.728771 1 N -73.950706 40.706783 1 13.0 0.5 0.5 2.86 0.00 0.3 17.16
13146 1 2016-01-02 02:09:58 2016-01-02 02:18:25 1 3.30 -73.980103 40.727058 1 N -73.970535 40.763950 1 11.0 0.5 0.5 2.00 0.00 0.3 14.30
13147 1 2016-01-02 02:09:58 2016-01-02 02:26:43 1 4.10 -73.982262 40.774155 1 N -74.003990 40.723877 2 14.0 0.5 0.5 0.00 0.00 0.3 15.30
13148 2 2016-01-02 02:09:58 2016-01-02 02:11:44 2 0.37 -73.992615 40.763527 1 N -73.987839 40.764698 2 3.5 0.5 0.5 0.00 0.00 0.3 4.80
13149 1 2016-01-02 02:09:59 2016-01-02 02:19:53 1 5.00 -73.950737 40.779366 1 N -73.941780 40.823292 2 16.0 0.5 0.5 0.00 0.00 0.3 17.30
13150 1 2016-01-02 02:09:59 2016-01-02 02:24:36 1 3.50 -73.958214 40.713757 1 N -73.913727 40.688580 1 14.0 0.5 0.5 2.00 0.00 0.3 17.30

151 rows × 19 columns

In [8]:
 
#help(plotly.offline.iplot)
 
## Insight 1: Passenger numbers
 * Most NY Taxi trips transport solo passengers

Insight 1: Passenger numbers

  • Most NY Taxi trips transport solo passengers
In [9]:
 
import numpy as np
import plotly.plotly as py
#import plotly.offline as offline
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
import plotly.graph_objs as go
init_notebook_mode()
#extract number of people per trip
peps_per_trip_df=df.loc[:, df.columns.str.match('passenger_count')]
peps_per_trip_df.shape
#print(type(peps_per_trip_df))
peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values
#print(type(peps_per_trip))
#layout=go.Layout(title="First Plot", xaxis={'title':'x1'}, yaxis={'title':'x2'})
data = [go.Histogram(x=peps_per_trip)]  #or [dataset1, darset2]
layout = go.Layout(
    title='Histogram of Passenger numbers',
    xaxis=dict(
        title='passenger number'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.2,
    bargroupgap=0.1
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,  filename='People_per_trip_histogram') #this plots in online mode, limit of 50/day in community a/c
#iplot(fig,  filename='People_per_trip_histogram') #This plots when offline; no limit
High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~elmao/0 or inside your plot.ly account where it is named 'People_per_trip_histogram'
Out[9]:
 
## Insight 2: cash versus credit 
* New Yorkers prefer to pay with credit card (60:40 ratio in preference of credit card)
* Cash usage is considerable at 40%. The cash option is a point of difference over competitor Uber.  
* Distribution of fares is similar across cash and credit card payments (median credit card fare is $1 higher than cash fare)
* Peak at $\$52$  is likely to represent Manhattan -> JFK airport trips (This has a flat rate fee of $52, source @wikipedia)
 
* NY taxi fares are cheap (compared to Melbourne!). Median fare around \$10 

Insight 2: cash versus credit

  • New Yorkers prefer to pay with credit card (60:40 ratio in preference of credit card)
  • Cash usage is considerable at 40%. The cash option is a point of difference over competitor Uber.
  • Distribution of fares is similar across cash and credit card payments (median credit card fare is $1 higher than cash fare)
  • Peak at $52 is likely to represent Manhattan -> JFK airport trips (This has a flat rate fee of $52, source @wikipedia)
  • NY taxi fares are cheap (compared to Melbourne!). Median fare around $10
In [10]:
 
# Distribution: Payment by type
# Add histogram data
# extract fares by payment type
# 1=cc, 2=cash, 3=no charge, 4=dispute, 5=unknown, 6=voided trip
fare_paymenttype1=df.loc[df['payment_type'] == 1, 'fare_amount'].values #credit card
fare_paymenttype2=df.loc[df['payment_type'] == 2, 'fare_amount'].values #cash
#fare_paymenttype4=df.loc[df['payment_type'] == 4, 'fare_amount'].values #dispute
fare_payments=np.append(fare_paymenttype1,fare_paymenttype2)
total_paymentstype1=df.loc[df['payment_type'] == 1, 'total_amount'].values   #fare+tips+tols
total_paymentstype2=df.loc[df['payment_type'] == 2, 'total_amount'].values   #fare+tips+tols
tip_amountstype1=df.loc[df['payment_type'] == 1, 'tip_amount'].values   #fare+tips+tols
total_payments=np.append(total_paymentstype1,total_paymentstype2)
numberofCCpays=df.loc[df['payment_type'] == 1, 'payment_type'].sum()
numberofCashpays=df.loc[df['payment_type'] == 2, 'payment_type'].sum()/2
PcentofCCpays=np.round(numberofCCpays*100/(numberofCashpays+numberofCCpays), decimals=1)
#print(PcentofCCpays)
PcentofCashpays=np.round(numberofCashpays*100/(numberofCashpays+numberofCCpays), decimals=1)
#print(PcentofCashpays)
#print(type(fare_paymenttype2[1:10]))
# Group data together
hist_data = [fare_paymenttype1,fare_paymenttype2]
find_median1=np.median(fare_paymenttype1)
find_median2=np.median(fare_paymenttype2)
#print(find_median)
group_labels = ['Credit card', 'Cash']
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=1.0)
fig.layout.update({'title': 'Distribution of Fares'})
fig.layout.xaxis1.update({'title': '$ amounts'})
# Plot!
#py.iplot(fig, filename='Distplot with Multiple Datasets') #online plot mode
iplot(fig, filename='Distplot with Multiple Datasets') #offline mode
from IPython.display import display, Math, Latex
display(Math(r'\text{Percentage of credit card payments is } %s \text{%%}' % PcentofCCpays))
display(Math(r'\text{Median credit payment is \$} %s ' % find_median1))
display(Math(r'\text{Percentage of cash payments is  } %s \text{%%}' % PcentofCashpays))
display(Math(r'\text{Median cash payment is \$} %s' % find_median2))
00.020.040.060.080.10.12020406080100120140Export to plot.ly »
Distribution of FaresCashCredit card$ amountsCash(12.825, 0.03561643)
Percentage of credit card payments is 60.8%
Median credit payment is $9.5
Percentage of cash payments is 39.2%
Median cash payment is $8.5
 
## Insight 3: fare breakdown
* Median Tip (credit card data only) is 20% of the fare

Insight 3: fare breakdown

  • Median Tip (credit card data only) is 20% of the fare
In [27]:
 
# Group data together
hist_data2 = [fare_payments,total_payments,tip_amountstype1]
group_labels2 = ['Fare', 'Total Charge', 'Tip Amount']
# Create distplot with custom bin_size
fig2 = ff.create_distplot(hist_data2, group_labels2, bin_size=[0.5,0.5,0.4])
fig2.layout.update({'title': 'Breakdown & Distribution of NY Taxi Fares'})
fig2.layout.xaxis1.update({'title': '$ amounts'})
# Plot!
#py.iplot(fig2, filename='Distplot with Multiple Datasets2') # online plot option
iplot(fig2, filename='Distplot with Multiple Datasets2') # offline plot option
find_mediantip=np.median(tip_amountstype1)
Med_tip_percentage=np.round(find_mediantip*100/find_median1, decimals=1)
display(Math(r'\text{Median tip payment (Credit card payment data only) is \$} %s ' % find_mediantip))
display(Math(r'\text{Median tip percentage (Credit card payment data only) is } %s \text{%%}' % Med_tip_percentage))
00.10.20.30.4050100150Export to plot.ly »
Breakdown & Distribution of NY Taxi FaresTip AmountTotal ChargeFare$ amounts
Median tip payment (Credit card payment data only) is $1.96
Median tip percentage (Credit card payment data only) is 20.6%
 
## Insight 4: Pick up and Drop off locations
* Manhattan (central business zone) is the busiest area for taxi use
* Airports (La Guardia and JFK) feature strongly in usage maps
    * Curiously, people get dropped off to the airports at very fixed locations, while pick-up locations are more diffuse  
        * Is there a culture of people wandering out from the airport and hailing taxis from wherever; no easily to locate taxi ranks?
        
        
* People **start taxi journeys** most frequently:
    1. in Manhattan on the **main streets**
    2. on the **main arterial routes** within residential areas (Brooklyn, Queens)
        * The *Sex And The City* imagery of hailing taxis on demand from busy streets is backed up by the data
    
    
* People **end taxi journeys** most frequently:
    1. again in Manhattan, both on main streets and off the main streets 
    2. at very **diffuse locations** across residential areas (Brooklyn, Queens, The Bronx)
        * The Bronx is a frequent drop-off location, but rarely a pick-up location 
            * An effect of green "boro taxis" since 2013? (Note, however, that boroughs where green taxis can be hailed include The Bronx, Queens and Brooklyn: yet the Bronx taxi pattern is notably different)

Insight 4: Pick up and Drop off locations

  • Manhattan (central business zone) is the busiest area for taxi use
  • Airports (La Guardia and JFK) feature strongly in usage maps
    • Curiously, people get dropped off to the airports at very fixed locations, while pick-up locations are more diffuse
      • Is there a culture of people wandering out from the airport and hailing taxis from wherever; no easily to locate taxi ranks?
  • People start taxi journeys most frequently:
    1. in Manhattan on the main streets
    2. on the main arterial routes within residential areas (Brooklyn, Queens)
      • The Sex And The City imagery of hailing taxis on demand from busy streets is backed up by the data
  • People end taxi journeys most frequently:
    1. again in Manhattan, both on main streets and off the main streets
    2. at very diffuse locations across residential areas (Brooklyn, Queens, The Bronx)
      • The Bronx is a frequent drop-off location, but rarely a pick-up location
        • An effect of green "boro taxis" since 2013? (Note, however, that boroughs where green taxis can be hailed include The Bronx, Queens and Brooklyn: yet the Bronx taxi pattern is notably different)
In [53]:
 
# Map the pick up locations
import pandas as pd
import matplotlib  
import matplotlib.pyplot as plt 
from matplotlib import rcParams  
df=df_big
#pd.options.display.mpl_style = 'default' #Better Styling 
matplotlib.pyplot.style.use('ggplot')
new_style = {'grid': False} #Grid off  
matplotlib.rc('axes', **new_style)  
rcParams['figure.figsize'] = (12, 12) #Size of figure  
rcParams['figure.dpi'] = 250
P=df.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3)
#P.set_axis_bgcolor('black') #Background Color
P.set_facecolor('black') #Background Colour
#plt.show()
In [55]:
 
# Map the drop off locations
df=df_big
import matplotlib  
import matplotlib.pyplot as plt 
from matplotlib import rcParams 
##Inline Plotting for jupyter Notebook 
#%matplotlib inline 
#pd.options.display.mpl_style = 'default' #Better Styling  
matplotlib.pyplot.style.use('ggplot')
new_style = {'grid': False} #Grid off  
matplotlib.rc('axes', **new_style)  
 
rcParams['figure.figsize'] = (12, 12) #Size of figure  
rcParams['figure.dpi'] = 250
P=df.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3)  #s is size and alpha is opaque-ness 
P.set_facecolor('black') #Background Colour
plt.show()
In [13]:
 
# Times of the day versus average fare.
#df1=[]
#df1=df_big  #renaming for test stage
print(df.shape)
# Make new column in dataframe with hour of day and day of the week
df['hour'] = pd.to_datetime(df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S').dt.hour
df['day'] = pd.to_datetime(df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S').dt.dayofweek
#find mean fare by weekday
meanfare_byhour=[] #initialise
for i in range(0,24):
    fares_byhour=df.loc[df['hour'] == i, 'fare_amount'].values #hourly fares
    meanfare_byhour.append(np.mean(fares_byhour))
    #print(i)
    #print(meanfare_byhour)
#Numeric weekday convention is 0:'SUN', 1:'Mon', 2:'Tue',3:'Wed',4:'Thu',5:'Fri',6:'Sat'
#find mean fare by weekday
meanfare_byweekday=[] #initialise
#print(meanfare_byweekday)
for i in range(0,7):
    fare_byweekday=df.loc[df['day'] == i, 'fare_amount'].values #weekday fares
    meanfare_byweekday.append(np.mean(fare_byweekday))
    #print(i)
    #print(meanfare_byweekday)
#plot bar chart of mean fare by weekday
data = [go.Bar(
            x=['Sun', 'Mon', 'Tue','Wed','Thu','Fri','Sat'],
            y=meanfare_byweekday
    )]
layout = go.Layout(
    xaxis=dict(tickangle=-45),
    barmode='group',
    title='Mean Fare by Weekday',
    yaxis=dict(
        title='$'
    ),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='basic-barWeekday')   
#plot bar chart of mean fare by hour of day
data2 = [go.Bar(
            x=['0:00', '1:00', '2:00','3:00','4:00','5:00','6:00', '7:00','8:00','9:00','10:00', '11:00', '12:00','13:00','14:00','15:00','16:00', '17:00','18:00','19:00','20:00', '21:00', '22:00','23:00','24:00'],
            y=meanfare_byhour
    )]
layout2 = go.Layout(
    xaxis=dict(tickangle=-45),
    barmode='group',
    title='Mean Fare by Hour',
    yaxis=dict(
        title='$'
    ),
)
fig2 = go.Figure(data=data2, layout=layout2)
iplot(fig2, filename='basic-barHour')    
    
    
#fares_byhour1=df.loc[df['hour'] == 0, 'fare_amount'].values
#fares_byhour2=df.loc[df['hour'] == 1, 'fare_amount'].values
#fares_byhour3=df.loc[df['hour'] == 2, 'fare_amount'].values
#fares_byhour4=df1.loc[df1['hour'] == 3, 'fare_amount'].values
# Group data together
#hist_data2 = [fares_byhour1,fares_byhour2,fares_byhour3]
#group_labels2 = ['0:00','1:00','2:00']
# Create distplot with custom bin_size
#fig2 = ff.create_distplot(hist_data2, group_labels2, bin_size=0.5)
#fig2.layout.update({'title': 'Distribution of NY Taxi Fares by hour'})
#fig2.layout.xaxis1.update({'title': '$ amounts'})
# Plot!
#py.iplot(fig2, filename='Distplot with Multiple Datasets2') # online plot option
#iplot(fig2, filename='Distplot with Multiple Datasets3') # offline plot option
(2389990, 21)
SunMonTueWedThuFriSat024681012Export to plot.ly »
Mean Fare by Weekday$
0:001:002:003:004:005:006:007:008:009:0010:0011:0012:0013:0014:0015:0016:0017:0018:0019:0020:0021:0022:0023:000246810121416Export to plot.ly »
Mean Fare by Hour$
In [30]:
 
#Top 10 busiest locations of the city
import reverse_geocoder as rg
from geopy.geocoders import Nominatim
df=df_big
#round the lat and long entries 
#Latitude_round=df.loc[df['payment_type'] == 1, 'fare_amount'].values
Latitude_round=np.round(df['pickup_latitude'].values, decimals=2)+0.005   #round and recentre grid box
Longitude_round=np.round(df['pickup_longitude'].values, decimals=2)+0.005 #round and recentre grid box
#print(Latitude_round[0:5])
#print(Longitude_round[0:5])
df.loc[:,'GridcodeLat'] = pd.Series(Latitude_round, index=df.index) #add column gridcodes to df
df.loc[:,'GridcodeLon'] = pd.Series(Longitude_round, index=df.index) #add column gridcodes to df
#find 10 locations with most common grid codes
mytable = df.groupby(['GridcodeLat','GridcodeLon']).size()
mytable.sort_values(inplace=True,ascending=False)
totaltrips=mytable.sum()
print('Total trips')
print(totaltrips)
Top10BusyPickupLocations=mytable.head(10)
#print(Top10BusyPickupLocations)
#print(type(Top10BusyPickupLocations))
Top10BusyPickupLocations=Top10BusyPickupLocations.to_frame()
#print(Top10BusyPickupLocations)
#print(type(Top10BusyPickupLocations))
#coordinates = (51.5214588,-0.1729636),(9.936033, 76.259952),(37.38605,-122.08385)
coordinates = Top10BusyPickupLocations.index.values.tolist()
#print(coordinates)
#type(coordinates)
#GridcodeLat GridcodeLon
import gmplot
#gmap = gmplot.from_geocode("New York") #DIDNT WORK
gmap = gmplot.GoogleMapPlotter(40.75, -73.9, 16) #manual map location boundaries: center_lat, center_lng, zoom
gmap.plot([40.85], [-73.95], 'cornflowerblue', edge_width=10)
#gmap.scatter(more_lats, more_lngs, '#3B0B39', size=40, marker=False)
#gmap.scatter(marker_lats, marker_lngs, 'k', marker=True)
#gmap.heatmap(heat_lats, heat_lngs)
gmap.draw("mymap.html")
Total trips
2389990
In [24]:
 
help(gmplot.GoogleMapPlotter)
Help on class GoogleMapPlotter in module gmplot.gmplot:

class GoogleMapPlotter(builtins.object)
 |  Methods defined here:
 |  
 |  __init__(self, center_lat, center_lng, zoom)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  circle(self, lat, lng, radius, color=None, c=None, **kwargs)
 |  
 |  draw(self, htmlfile)
 |      # create the html file which include one google map and all points and
 |      # paths
 |  
 |  get_cycle(self, lat, lng, rad)
 |  
 |  grid(self, slat, elat, latin, slng, elng, lngin)
 |  
 |  heatmap(self, lats, lngs, threshold=10, radius=10, gradient=None, opacity=0.6, dissipating=True)
 |      :param lats: list of latitudes
 |      :param lngs: list of longitudes
 |      :param threshold:
 |      :param radius: The hardest param. Example (string):
 |      :return:
 |  
 |  marker(self, lat, lng, color='#FF0000', c=None, title='no implementation')
 |  
 |  plot(self, lats, lngs, color=None, c=None, **kwargs)
 |  
 |  polygon(self, lats, lngs, color=None, c=None, **kwargs)
 |  
 |  scatter(self, lats, lngs, color=None, size=None, marker=True, c=None, s=None, **kwargs)
 |  
 |  write_grids(self, f)
 |  
 |  write_heatmap(self, f)
 |  
 |  write_map(self, f)
 |      # TODO: Add support for mapTypeId: google.maps.MapTypeId.SATELLITE
 |  
 |  write_paths(self, f)
 |  
 |  write_point(self, f, lat, lon, color, title)
 |  
 |  write_points(self, f)
 |  
 |  write_polygon(self, f, path, settings)
 |  
 |  write_polyline(self, f, path, settings)
 |  
 |  write_shapes(self, f)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  from_geocode(location_string, zoom=13) from builtins.type
 |  
 |  geocode(location_string) from builtins.type
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

In [ ]:
 
# find addresses of co-ordinates..a bit awkward, lets go google maps instead
results = rg.search(coordinates) # default mode = 2, reverse geocode from lat and long to address
print(results)
geolocator = Nominatim()
#locations = geolocator.reverse("40.755,     -73.985")
for i in range(0,30):
    try:
        location = geolocator.reverse(coordinates[i])
        PlaceNames=location.address.split(",")
        #print(location)
    except:
        print('here')
        PlaceNames='Unknown, Unknown, Unknown, Unknown, Unknown,Unknown, Unknown, Unknown, Unknown, Unknown'
    
    print([PlaceNames[-8],PlaceNames[-7],PlaceNames[-6]] )
    
#df1.loc[:,'f'] = p.Series(np.random.randn(sLength), index=df1.index) #add column f to df1
#plot table or pie chart
In [52]:
 
#Top10BusyPickupLocations['GridcodeLat','GridcodeLon'].values
Top10BusyPickupLocations.index.values
coordinates[2]
Out[52]:
(40.765, -73.97500000000001)
In [ ]:
 
1,#plot pie chart of Top 10 busiest locations
# Add graph data
trace1={'labels': ['1st', '2nd', '3rd', '4th', '5th'],
        'values': [38, 27, 18, 10, 7],
        'type': 'pie',
        'name': 'Starry Night',
        'marker': {'colors': ['rgb(56, 75, 126)',
                              'rgb(18, 36, 37)',
                              'rgb(34, 53, 101)',
                              'rgb(36, 55, 57)',
                              'rgb(6, 4, 4)']},
            'domain': {'x': [0, 1],
                       'y': [.4, 1]},
            'hoverinfo':'label+percent+name',
            'textinfo':'none'
        }
# Add trace data to figure
figure['data'].extend(go.Data([trace1]))
# Edit layout for subplots
figure.layout.yaxis.update({'domain': [0, .30]})
# The graph's yaxis2 MUST BE anchored to the graph's xaxis2 and vice versa
# Update the margins to add a title and see graph x-labels. 
figure.layout.margin.update({'t':75, 'l':50})
figure.layout.update({'title': 'Starry Night'})
# Update the height because adding a graph vertically will interact with
# the plot height calculated for the table
figure.layout.update({'height':800})
# Plot!
py.iplot(figure)
In [ ]:
 
#classfiy into manhattan, JFK airport, laGuardia
#Q's what percentage are those airport trips
# map the fare disputes/ scrap as not many of these
#find out % of trips paid by cc versus cash
#insights: lots of drop offs to brooklyn, queens, bronx.  less pick ups from these areas.  People get taxi's home rather than to work?
#time of day?, weekend?   And people seem to get picked up from main streets!  (the sex and city iconography of hailing a cab is true!)
#interesting in times of UBER
In [90]:
 
#plot Distribution: Passenger numbers per trip
import numpy as np
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.figure_factory as ff
import plotly.graph_objs as go
#peps_per_triprav = peps_per_trip.ravel() 
print(peps_per_trip)
#below ply works, put plotly dist plot not happy
#print(df.shape)
#df = df.replace('[]', np.nan)#Soln (a) replace all elements that have any empty value with NaN values
#df=df.dropna()     #Soln (b) drop all rows that have any NaN values
#print(df.shape)
peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values
hist_data = [peps_per_trip]
group_labels = ['distplot']
#plt.plot(peps_per_trip)
#plt.show()
fig = ff.create_distplot(hist_data, group_labels)
fig['layout'].update(title='Distribution: Passenger numbers per trip')
py.iplot(fig, filename='DistplotPepsPerTrip')
[[2]
 [5]
 [1]
 ..., 
 [1]
 [1]
 [1]]
C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2487: RuntimeWarning:

Degrees of freedom <= 0 for slice

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2496: RuntimeWarning:

divide by zero encountered in double_scalars

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2496: RuntimeWarning:

invalid value encountered in multiply

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-90-127596d598e4> in <module>()
     21 #plt.show()
     22 
---> 23 fig = ff.create_distplot(hist_data, group_labels)
     24 
     25 fig['layout'].update(title='Distribution: Passenger numbers per trip')

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\plotly\figure_factory\_distplot.py in create_distplot(hist_data, group_labels, bin_size, curve_type, colors, rug_text, histnorm, show_hist, show_curve, show_rug)
    190             hist_data, histnorm, group_labels, bin_size,
    191             curve_type, colors, rug_text,
--> 192             show_hist, show_curve).make_kde()
    193 
    194     rug = _Distplot(

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\plotly\figure_factory\_distplot.py in make_kde(self)
    309                                    / 500 for x in range(500)]
    310             self.curve_y[index] = (scipy_stats.gaussian_kde
--> 311                                    (self.hist_data[index])
    312                                    (self.curve_x[index]))
    313 

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\stats\kde.py in __init__(self, dataset, bw_method)
    169 
    170         self.d, self.n = self.dataset.shape
--> 171         self.set_bandwidth(bw_method=bw_method)
    172 
    173     def evaluate(self, points):

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\stats\kde.py in set_bandwidth(self, bw_method)
    496             raise ValueError(msg)
    497 
--> 498         self._compute_covariance()
    499 
    500     def _compute_covariance(self):

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\stats\kde.py in _compute_covariance(self)
    507             self._data_covariance = atleast_2d(np.cov(self.dataset, rowvar=1,
    508                                                bias=False))
--> 509             self._data_inv_cov = linalg.inv(self._data_covariance)
    510 
    511         self.covariance = self._data_covariance * self.factor**2

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\linalg\basic.py in inv(a, overwrite_a, check_finite)
    656 
    657     """
--> 658     a1 = _asarray_validated(a, check_finite=check_finite)
    659     if len(a1.shape) != 2 or a1.shape[0] != a1.shape[1]:
    660         raise ValueError('expected square matrix')

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\_lib\_util.py in _asarray_validated(a, check_finite, sparse_ok, objects_ok, mask_ok, as_inexact)
    226             raise ValueError('masked arrays are not supported')
    227     toarray = np.asarray_chkfinite if check_finite else np.asarray
--> 228     a = toarray(a)
    229     if not objects_ok:
    230         if a.dtype is np.dtype('O'):

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py in asarray_chkfinite(a, dtype, order)
   1031     if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all():
   1032         raise ValueError(
-> 1033             "array must not contain infs or NaNs")
   1034     return a
   1035 

ValueError: array must not contain infs or NaNs

In [65]:
 
#import plotly
#plotly.tools.set_credentials_file(username='eosg', api_key='AmlsmkQM0FkVbEPtlQSf')
#plotly.tools.set_credentials_file(username='elmao', api_key='8z69RhuTfVA7EdkIEtXZ')
 
## If you run a taxi company, how would you maximize your earnings?
Uber is a major market distrupter in the taxi space.  To maximise taxi company earnings, concurrent analysis of uber versus taxi data is nesscessary.
Thoughts: On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi)
* UberT has entered the market gap here (can request a yellow taxi to your door through the uber app)

If you run a taxi company, how would you maximize your earnings?

Uber is a major market distrupter in the taxi space. To maximise taxi company earnings, concurrent analysis of uber versus taxi data is nesscessary.

Thoughts: On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi)

  • UberT has entered the market gap here (can request a yellow taxi to your door through the uber app)
In [6]:
 
#basic Histograms
#extract number of people per trip
peps_per_trip_df=df.loc[:, df.columns.str.match('passenger_count')]
peps_per_trip_df.shape
print(type(peps_per_trip_df))
peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values
print(type(peps_per_trip))
fare_paymenttype1=df.loc[df['payment_type'] == 1, 'fare_amount'].values
fare_paymenttype2=df.loc[df['payment_type'] == 2, 'fare_amount'].values
fare_paymenttype4=df.loc[df['payment_type'] == 4, 'fare_amount'].values
type(fare_paymenttype1)
#1=cc, 2=cash, 3=no charge, 4=dispute, 5=unknown, 6=voided trip
#rate code ID (final rate code at end of the trip): 1=standard rate, 2=JFK, 3=Newark, 5=Nassau or Westchester, 5=Negotiated fare, 6=Group ride
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
Out[6]:
numpy.ndarray
In [ ]:
 
Rendering widgets...